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Background / Context / Significance: 

Cluster randomized trials (CRTs) have become increasingly common in education as a 
means for evaluating the effectiveness of educational interventions. In fact, since 2002, more 
than 80 CRTs have been funded by the Institute of Education Sciences (lES), the research branch 
of the U.S. Department of Education. The trials examine a broad array of interventions including, 
but not limited to, reading curricula, math curricula, science curricula, professional development 
programs, and social and character development programs. The interventions target students as 
young as pre-K through post-high school (http://ies.ed.gov/) . 

CRTs have become more widespread in evaluations of the effectiveness of educational 
programs and policies for two primary reasons. Eirst, when they are feasible and if they are well 
designed and implemented, randomized trials are the best way to establish causal relationships 
(Boruch, 1997; Boruch, DeMoya, & Synder, 2002; Cook, 2002). Second, the natural clustering 
in our education system, students within classrooms within schools within districts, and the fact 
that educational interventions are typically delivered at the classroom, school, or district level, 
make CRTs particularly relevant for education studies (Bloom, 2005; Boruch & Eoley, 2000; 
Cook, 2005). The goal is that over time the evidence provided by rigorous evaluations of 
educational programs or policies, rigorous being defined as experimental or high-quality quasi- 
experimental studies, will accumulate and transform education into an evidence-based field 
(Whitehurst, 2003). 

However, the sheer presence of CRTs to evaluate educational programs and policies is 
not enough to transform education into an evidence-based field. As noted above, the trials must 
be well-designed and implemented in order to generate high-quality evidence of program 
effectiveness. Although there are many elements involved in the design and implementation of a 
study, we limit the scope of this paper to the statistical power of the study. We focus on power 
because underpowered studies represent a serious threat to the success of CRTs (Boruch, 2005; 
Boruch & Eoley, 2000). 

The field has made substantial progress in terms of how to calculate statistical power for 
CRTs for continuous outcomes, such as academic achievement, in the past 15 years. Raudenbush 
(1997) introduced power calculations for a two-level CRT (2-level CRT) and illustrated two key 
points in a power analysis for a CRT: 1) the total number of clusters influences the power more 
than the total number of individuals and 2) the higher the intraclass correlation (ICC), or the 
variability between clusters relative to the total variability, the lower the statistical power. In 
1998, Murray published a book dedicated to the design and analysis of CRTs which was 
followed by another book by Donner and Klar (2000) on the same topic, though specifically 
geared towards the health sciences. Since then, numerous others have contributed by extending 
the work to additional designs including three level designs and blocked designs 
(Konstantopoulos, 2008; Raudenbush, Spybrook, & Martinez, 2007; Schochet 2008). 

The accessibility of planning parameters has also contributed to the improved accuracy of 
power analyses for CRTs in education. Several studies have provided evidence to suggest that 
ICCs for academic achievement are likely to be between 0.15 and 0.25 (Bloom, Richburg-Hayes, 
& Black, 2007; Bloom, Bos, & Eee, 1999; Hedges & Hedberg, 2007; Schochet, 2008). Bloom, 
Richburg-Hayes, and Black (2007) also illustrated the importance of including covariates and 
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provided empirical evidence suggesting that for achievement outcomes, pretests could explain 
between 40 and 80 percent of the variation in the outcome. 

However, outcomes of interest are not always continuous in nature. For example, a key 
outcome in education is graduation status. Ultimately, increasing the number of students who 
graduate from high school is an important national goal in education with long term implications 
for the future work force. However, graduation status is not a continuous outcome but rather a 
binary outcome i.e. a student either graduates or does not graduate. Binary outcomes rely on 
different assumptions than continuous outcomes hence the power analysis will necessarily be 
different. 

Power analyses for binary outcomes in single level designs has been well documented 
(Fleiss, 1981; Hsieh, Block, & Larsen, 1998; Diggle, Heagerty, Liang, & Zeger, 2002). Leon 
(2004) presented power tables for repeated observations of a binary outcome. Murray (1998) 
extended the work from single level studies to CRTs. The power calculations presented by 
Murray (1998) use the same parameters as those used for continuous outcomes including sample 
sizes at all levels, difference in the outcome between clusters, and the ICC. Moerbeek, 
VanBreukelen, and Berger (2001) examine the optimal level of randomization and the optimal 
allocation of units when the outcome is binary for two-level CRTs. They use an alternative 
approach in which the power calculations do not include the ICC, a standardized parameter that 
is commonly used in power calculations for continuous outcomes. Instead, the power 
calculations use the unstandardized within cluster variance and between cluster variance. 
However, an established framework for power analyses for CRTs with binary appears to be 
much less developed than for continuous outcomes. 

Purpose / Objective / Research Question / Focus of Study: 

The purpose of this paper is to provide a framework for approaching a power analysis for 
a CRT with a binary outcome. We suggest a framework in the context of a simple CRT and then 
extend it to a blocked design, or a multi-site cluster randomized trial (MSCRT) . The framework 
is based on proportions, an intuitive parameter when the outcome is binary. In addition, we 
provide sample power tables to provide readers with some intuition regarding sample sizes for 
CRTs with binary outcomes. 

Statistical Models: 

Following the hierarchical linear modeling (HLM) framework (Raudenbush & Bryk, 
2002), the level-1 model is comprised of three parts: the sampling model, the link function, and 
the structural model. The level- 1 sampling model defines the probability that the event will 
occur. Let Ty=l if an event (often called a “success”) occurs and Yij=0 if not. The sampling 
model is: 

~S(m,,^,) [1] 

for iG{l,2,...,n^} students per school and for jG{l,2,...,/} schools; 
where is the number of trials for student i in school j; and 
is the probability of success for student i in school j. 

^ Due to the space limitation, we focus on the simple CRT in this proposal. In the full paper, both the CRT and 
MSCRT will be included. 
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The expected value and variance of are: 

E(Yij I = my(p,j Var(Y.j I ^,^ ) = (1 - ) 



[ 2 ] 



Note that in the case of a Bernoulli trial, m„ = 1 so the expected value of T, I reduces to ^..and 

ij A ij T ij T ij 

the variance reduces to (1-^, ). A common link function for a binary outcome is the logit link: 



r 



Vii = log 



<t>u 






1-^ 



[3] 



y J 



where is the log odds of success. 

The third part of the level- 1 model is the structural model: 

^y = ^o; 



[4] 



where is the average log odds of success per school j. 

The level-2 model has the same form as the level-2 model for a 2-level CRT with a 
continuous outcome. However, the interpretation of the parameters differs because of the logit 
link function: 



A) j Voo Voi^j ^0 j ’ ^0 j ~ Ef{0,T) [5] 

where is the average log odds of success across schools; 

Yqi is the treatment effect in log odds; 

Wj is Vi for treatment and -Vi for control; 

Uqj is the random effect associated with each school mean; and 

ris the between school variance in log odds. 

In combined form, the model is: 

Vij = Xoo + + Aoy • [6] 



Power Calculations: 

We use a first order Taylor series approximation to linearize the model. Under MQL, we 
linearize around the fixed part of equation 6 (Breslow & Clayton, 1993). After linearization, 
the hypothesis testing and power calculations are very straightforward. We are interested in 
testing whether the treatment effect, =0. Under the null hypothesis, the test statistic follows 
a central t-distribution. Under the alternative hypothesis, the test statistic follows a noncentral t- 
distribution with 7-2 degrees of freedom and noncentrality parameter X , where 



^4(r + a^ ! n) 

The power for a two-sided test is: 

Power = 1 - - A) -I- - A) [8] 

where cp is the cumulative distribution function for the t-distribution; and 

U /2 j- 2 ^^ the critical value under the null hypothesis with J-2 degrees of freedom. 



2011 SREE Conference Abstract Template 



3 




As the noncentrality parameter increases, the power increases. Although the power calculations 
appear quite straightforward, you may recall that the estimates necessary for the noncentrality 
parameter are in log odds, which is not a readily usable metric. For example, if a researcher is 
designing a study in which graduation status is the primary outcome of interest, he is more likely 
to think about the proportion of students graduating than the log-odds of student graduating. 
Also, in terms of the variability across schools, he is more likely to think about the variability in 
graduation rates across schools, not the variability in log-odds of graduation rates across schools. 
Because a power analysis is only as good as the parameters that are used, we propose to use 
more intuitive parameters to guide the power analysis. 

Usefulness / Applicability of Method: 



The phrase binary outcome immediately conjures up the term proportions. Thus we use 
the proportions to guide the power analyses. We conduct the power analysis from estimates of 
four parameters including the proportion of successes in the treatment group, ^£,the proportion 
of successes in the control group, (p^ , and a lower and upper bound on the proportion of 
successes in the control group, and . Next we describe how the proportions are 

translated into the noncentrality parameter necessary for the power calculations. 

We begin by examining the numerator of the noncentrality parameter, or the difference in 
the treatment and control group. The proportion of successes in the treatment and control group 



can easily be converted to log odds using the following rj^ = log 




and 



rif. = log — - — , such that the difference between and is now the estimate of the 

) 

difference between the treatment and control group in log odds. The denominator of the 
noncentrality parameter includes two variance components, and r . Because the outcome is 
binary, the within school variance is a function of the proportion of successes in the treatment 
and control group and is easily calculated from estimates of the proportions. The between cluster 
variance, t, in the context of a binary outcome or in terms of log odds is not an obvious or 
intuitive parameter for study planners. However, it is more likely that a researcher can estimate 
the lower bound, , and upper bound, , of a 95 percent plausible value range for the 

proportion of successes among the control schools. Converting these bounds to log odds. 



=log 






= log 



' ^UB 

1 “ ^c„ 



, we can now assume that the log-odds follow an 



approximately normal distribution. The midpoint of the interval is = 



.A 95 



percent plausible value interval around is ± 1 .96Qvar(rf^^ ) . The term 

var(? 7 ^ ) represents r , the between cluster variation among the control schools. Algebraic 



manipulation of the plausible interval reveals that r = 



1.96 



. In other words, if the 
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researcher can estimate an upper and lower bound of successes across control schools, r can 
easily be calculated. Hence , and r , the three parameters required for the power 

calculations in equation 8 can be calculated from the four proportions, all of which are intuitive 
for researchers designing studies with binary outcomes. 

Example: 

Suppose that a team of researchers are interested in testing the effectiveness of a new 
stay-in-school campaign. They select a sample of 30 schools to participate in the study. The 
outcome is whether or not a student in 12* grade graduates. On average there are 150 12* 
graders per school. Based on school history, they expect that the graduation rate across schools is 
about 70 percent, with a range from 55 to 90 percent. They believe that the treatment, 
participation in the stay-in-school campaign, will boost graduation rates by 9 percentage points. 

We use Optimal Design V2.0 to produce the power curves. Optimal Design calculates the 
power based on the four proportions as well as the two sample sizes. Figure 1 displays the power 
curve for the example. As you can see, the power increases as the number of schools increases. 
For example, approximately 44 schools (rounded up from 43 assuming equal allocation) would 
be required to achieve power of 0.80. 

General Intuitions: 

The example above introduced the intuitive parameters guiding the power analysis and 
showed the power for one specific case. Given that these parameters are likely more accessible 
and intuitive for researchers, we examine how the sample size, probabilities of success in the 
treatment and control conditions, and range of the plausible intervals affect the power. Table 1 
provides the power for a fixed total of 20 clusters and 50 individuals per cluster. We vary the 
probability of success in the treatment and control condition as well as the plausible interval. In 
the full paper, we provide several tables which vary different parameters and examine the 
patterns in statistical power. 

Conclusions: 

Binary outcomes, such as graduation status or retention status, play an important role in 
studies of educational interventions. Designing studies with binary outcomes requires a shift 
from traditional parameters we use in the design of studies with continuous outcomes to 
parameters that are intuitive when dealing with binary outcomes. We propose a framework for 
conducting power analyses based on proportions. We contend that basing power calculations on 
intuitive parameters will strengthen the quality and accuracy of power analyses for CRTs with 
binary outcomes. 
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Appendix B. Tables and Figures 




Figure 1 . Power curve for 2-level CRT with binary outcome. 



a = 0.050 
(fg = 0.790000 
(|:q = 0.700000 

low er plausible value = 0.550000 
upper plausible value = 0.900000 

n=150 



2011 SREE Conference Abstract Template 



B-1 




Table 1 . The power to detect the main effect of treatment given 20 clusters and 50 persons per 
cluster. 



PhiE 


PhiC 


PI (0.1, 0.9) 


PI (0.2 ,0.8) 


PI (0.3 ,0.7) 


0.1 


0.2 


0.30 


0.55 






0.3 


0.37 


0.94 


0.99 




0.4 


0.89 


0.99 


0.99 




0.5 


0.97 


0.99 




0.2 


0.3 


0.16 


0.31 


0.55 




0.4 


0.43 


0.76 


0.97 




0.5 


0.71 


0.97 


0.99 




0.6 


0.91 


0.99 


0.99 


0.3 


0.4 


0.13 


0.23 


0.43 




0.5 


0.34 


0.65 


0.93 




0.6 


0.63 


0.93 


0.99 




0.7 


0.87 


0.99 


0.99 


0.4 


0.5 


0.12 


0.20 


0.38 




0.6 


0.32 


0.61 


0.91 




0.7 


0.63 


0.93 


0.99 




0.8 


0.91 


0.99 




0.5 


0.6 


0.12 


0.20 


0.38 




0.7 


0.34 


0.65 


0.93 




0.8 


0.71 


0.97 






0.9 


0.97 






0.6 


0.7 


0.13 


0.23 


0.43 




0.8 


0.43 


0.76 


0.97 




0.9 


0.89 






0.7 


0.8 


0.16 


0.31 


0.55 




0.9 


0.67 






0.8 


0.9 


0.30 







2011 SREE Conference Abstract Template 



B-2 




